Rationale and Research Questions

Background

In the context of escalating global environmental challenges, the shift from traditional fossil fuel-based energy sources to renewable energy has become a focal point in efforts to reduce carbon emissions and combat climate change (Kabeyi et. al, 2022). However, the transition’s broader environmental impacts, particularly on air quality, remain less explored. This transition is especially relevant given the increasing global energy demand and the need to meet this demand sustainably.

Significance

Understanding the relationship between renewable energy generation and air quality is crucial. Renewable energy sources like wind and solar are lauded for their lower environmental impact compared to fossil fuels, which are major contributors to air pollution (UCSUSA, 2018). Air pollution is a significant environmental hazard, affecting human health, ecosystems, and the climate. It is responsible for millions of premature deaths annually and contributes to the occurrence of diseases like asthma, heart disease, and lung cancer (National Geographic, 2023). Therefore, assessing how increased renewable energy generation affects air quality indicators is not just an environmental concern but a public health imperative.

Theoretical Context

The hypothesis underlying this research is that an increase in renewable energy generation leads to a reduction in air pollution. This hypothesis is grounded in the understanding that renewable energy sources, unlike fossil fuels, do not emit pollutants like sulfur dioxide, nitrogen oxides, and particulate matter during electricity generation.

Research Questions

1: What is the relationship between distributed renewable energy generation and the level of air pollution?

This question aims to investigate the correlation between the rise in renewable energy generation and the concentrations of various air pollutants. It seeks to understand whether regions with higher renewable energy output exhibit lower levels of air pollutants.

2: Among air quality indicators (PM10, PM2.5, CO, NO2, and SO2), which display the most significant response to variations in energy generation?

This question delves deeper into identifying which specific pollutants are most responsive to changes in energy generation types. It is crucial for pinpointing the environmental benefits of renewable energy sources and for policy-making aimed at targeted air pollution reduction.

Dataset Information

The exploratory analysis required the combination of air quality and power plant datasets. Air quality data in the analysis was obtained from the United States Environmental Protection Agency while power plant data was obtained from the U.S. Energy Information Administration. The sample years were the two decades namely 2001 - 2021.

Dataset Information for the Sample Period 2001 - 2021
Dataset Source Variables Used
Air Quality Summary Statistics by Criteria Pollutants and Location EPA Air Quality System (AQS) Monthly Mean Ozone and PM2.5
Power Plant Generator Level Capacities and Locations EIA Form EIA-860 Annual Installed Generation Capacity by Fuel Type
Power Plant Monthly Energy Generation EIA Form EIA-923 Monthly Net Generation by Fuel Type

Exploratory Analysis

We began our exploratory analysis by examining if and how solar and wind energy generating capacity, net generation and air quality has changed over time in the contiguous United States. To begin to visualize this, we used the wrangled power plant to visualize both quantitatively and spatially the change in total solar and wind plant installed capacity over the period 2001 - 2021. From the exploratory line plot, we note the three states with the highest cumulative installed capacity include California, Texas and Iowa which have significant growth in installed capacity compared to other states. To visualize the energy generation associated with these installations, we also plotted the annual energy generation from solar and wind over a similar period and noted that the states with the largest installed capacity are also the states with the highest annual energy generation from these renewable sources.

The change in installed solar and wind plants can be also be visualized spatially across the contiguous United States as illustrated below. From the map, we note that the number of installed solar and wind plants increased significantly over the two decades from 2001 - 2021.

Narrowing down on the top three states with the highest growth in installed capacity, we used ggplot and gganimate to visualize the increase and distribution of plants within the state as shown below:

After establishing that installed capacity of solar and wind plants has increased over time in the United States and particularly in California, Texas and Iowa, we began to explore air quality data to establish the changes in the concentrations of key criteria pollutants over time. To facilitate this, we visualized the wrangled air quality data over time for key pollutants related to fossil fuel generation including SO2, NOX, and PM2.5. Based on the outputs shown below, it can be observed that the amounts of pollutants measured has been trending downwards over time.

Analysis

Question 1:

What is the relationship between distributed renewable energy generation and the level of air pollution?

We can formulate a null and alternative hypothesis for the above research question as follows: H0: There is no change in recorded air quality with an increase in renewable energy generation in the states of California, Texas and Iowa over the period 2001 - 2021. Ha: There is a change in recorded air quality with an increase in renewable energy generation in the states of California, Texas and Iowa over the period 2001 - 2021.

To evaluate this hypothesis, we generated a plot of Mean PM2.5 measured against net monthly solar and wind generation for all three states.

The figures above suggest that the measured value of the pollutants has an inverse relationship or negative correlation with net generation from solar. This implies that the higher the amount of energy generated from wind and solar, the lower the amount of the three criteria pollutants. To investigate this further, we performed a simple linear regression of the relationship between the mean quantity of each pollutant and net energy generation with the results summarized in the table outlined below:

## 
## Call:
## lm(formula = MeanPM25 ~ NetGeneration, data = df_CA.energy.air.data.PM2.5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.3063 -2.3934 -0.9329  1.0735 24.1288 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.200e+01  3.343e-01  35.905  < 2e-16 ***
## NetGeneration -7.729e-07  1.575e-07  -4.907 1.67e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.675 on 250 degrees of freedom
## Multiple R-squared:  0.08784,    Adjusted R-squared:  0.08419 
## F-statistic: 24.07 on 1 and 250 DF,  p-value: 1.672e-06
## 
## Call:
## lm(formula = MeanPM25 ~ NetGeneration, data = df_TX.energy.air.data.PM2.5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1238 -1.1574 -0.1915  1.1334  7.2755 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.079e+01  1.613e-01  66.883  < 2e-16 ***
## NetGeneration -2.771e-07  3.781e-08  -7.329  3.2e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.746 on 250 degrees of freedom
## Multiple R-squared:  0.1769, Adjusted R-squared:  0.1736 
## F-statistic: 53.72 on 1 and 250 DF,  p-value: 3.197e-12
## 
## Call:
## lm(formula = MeanPM25 ~ NetGeneration, data = df_IA.energy.air.data.PM2.5)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1263 -1.7642 -0.1452  1.1926 10.7097 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.111e+01  2.235e-01  49.733  < 2e-16 ***
## NetGeneration -1.283e-06  1.557e-07  -8.239 9.78e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.405 on 250 degrees of freedom
## Multiple R-squared:  0.2136, Adjusted R-squared:  0.2104 
## F-statistic: 67.88 on 1 and 250 DF,  p-value: 9.784e-15
## 
## Call:
## lm(formula = MeanSO2 ~ NetGeneration, data = df_CA.energy.air.data.SO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.60143 -0.27853 -0.06666  0.23838  1.28392 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.233e+00  3.179e-02   38.78   <2e-16 ***
## NetGeneration -2.121e-07  1.498e-08  -14.16   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3494 on 250 degrees of freedom
## Multiple R-squared:  0.4451, Adjusted R-squared:  0.4429 
## F-statistic: 200.6 on 1 and 250 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = MeanSO2 ~ NetGeneration, data = df_TX.energy.air.data.SO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.86002 -0.43834 -0.00575  0.30639  1.92076 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.349e+00  4.912e-02  27.468  < 2e-16 ***
## NetGeneration -9.488e-08  1.152e-08  -8.238 9.88e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5318 on 250 degrees of freedom
## Multiple R-squared:  0.2135, Adjusted R-squared:  0.2103 
## F-statistic: 67.86 on 1 and 250 DF,  p-value: 9.879e-15
## 
## Call:
## lm(formula = MeanSO2 ~ NetGeneration, data = df_IA.energy.air.data.SO2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.46099 -0.53054 -0.04837  0.46070  2.43395 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    2.119e+00  6.810e-02   31.11   <2e-16 ***
## NetGeneration -6.421e-07  4.744e-08  -13.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7329 on 250 degrees of freedom
## Multiple R-squared:  0.4229, Adjusted R-squared:  0.4206 
## F-statistic: 183.2 on 1 and 250 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = MeanNO2 ~ NetGeneration, data = df_CA.energy.air.data.NOX)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9093 -2.2522  0.0044  1.9175  7.2354 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    1.466e+01  2.632e-01   55.70   <2e-16 ***
## NetGeneration -1.826e-06  1.240e-07  -14.72   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.893 on 250 degrees of freedom
## Multiple R-squared:  0.4644, Adjusted R-squared:  0.4622 
## F-statistic: 216.7 on 1 and 250 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = MeanNO2 ~ NetGeneration, data = df_TX.energy.air.data.NOX)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.4112 -1.8887 -0.1455  1.6983  5.9992 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    9.526e+00  2.164e-01  44.025   <2e-16 ***
## NetGeneration -4.695e-07  5.073e-08  -9.254   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.343 on 250 degrees of freedom
## Multiple R-squared:  0.2551, Adjusted R-squared:  0.2522 
## F-statistic: 85.63 on 1 and 250 DF,  p-value: < 2.2e-16
## 
## Call:
## lm(formula = MeanNO2 ~ NetGeneration, data = df_IA.energy.air.data.NOX)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.6786 -1.4912 -0.1341  1.4061  6.4046 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)    8.393e+00  1.871e-01   44.87   <2e-16 ***
## NetGeneration -1.501e-06  1.295e-07  -11.59   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.988 on 247 degrees of freedom
## Multiple R-squared:  0.3522, Adjusted R-squared:  0.3496 
## F-statistic: 134.3 on 1 and 247 DF,  p-value: < 2.2e-16
airQualityAIC <- lm(data = df_CA.energy.air.data.PM2.5,
                     MeanPM25 ~ Month + Year + NetGeneration)

#Choosing a model by AIC in a stepwise algorithm
step(airQualityAIC)
## Start:  AIC=640.19
## MeanPM25 ~ Month + Year + NetGeneration
## 
##                 Df Sum of Sq    RSS    AIC
## - NetGeneration  1    13.932 3110.7 639.32
## <none>                       3096.7 640.19
## - Year           1    57.554 3154.3 642.83
## - Month          1   226.376 3323.1 655.97
## 
## Step:  AIC=639.32
## MeanPM25 ~ Month + Year
## 
##         Df Sum of Sq    RSS    AIC
## <none>               3110.7 639.32
## - Month  1    228.85 3339.5 655.21
## - Year   1    362.43 3473.1 665.09
## 
## Call:
## lm(formula = MeanPM25 ~ Month + Year, data = df_CA.energy.air.data.PM2.5)
## 
## Coefficients:
## (Intercept)        Month         Year  
##    407.3051       0.2761      -0.1980

# Interpretation of the Boxplots #it may be redundant, we can decide if we want to keep it or not..

PM2.5 Levels: The distribution of PM2.5 levels varies widely between states. California shows a particularly high range of PM2.5 concentrations with notable outliers, indicating episodes of very poor air quality. It’s important to look into the reasons for California’s variability, such as wildfires or urban pollution.

CO Levels: CO levels are relatively uniform across the states, with fewer outliers compared to PM2.5. This could indicate a more consistent source of CO pollution, such as traffic, across these states.

NO2 Levels: NO2 levels are somewhat variable, with Illinois showing a higher median concentration. This could be associated with industrial activities or high traffic density.

PM10 Levels: PM10 shows a spread similar to PM2.5, with California again showing high variability and outliers. This suggests common sources of particulate matter affecting both PM10 and PM2.5.

SO2 Levels: The distribution of SO2 is quite tight in most states, except for a few outliers. This pollutant is often associated with industrial processes and the burning of sulfur-containing fuels.

library(dplyr)

# df_generation_monthly
top_states <- df_generation_monthly %>%
  group_by(State) %>%
  summarise(TotalGeneration = sum(NetGeneration)) %>%
  top_n(10, TotalGeneration) %>%
  pull(State)

#Time Series Plot for Renewable Energy Generation
ggplot(df_generation_monthly %>% filter(State %in% top_states), 
       aes(x = Date, y = NetGeneration, group = State, color = State)) +
  geom_line() +
  labs(title = "Renewable Energy Generation Over Time in Top 10 States",
       x = "Date",
       y = "Net Generation (MWh)") +
  theme_minimal()

#PM

Interpretation

The line graph above displays the trend of renewable energy generation over time in the top 10 states. We can see that there is a general upward trend in renewable energy generation across all states, indicating increased adoption and capacity over time. California (CA) stands out with a significantly higher generation, especially with a steep increase around 2020. Other states also show growth in renewable energy generation but to varying degrees. For instance, Texas (TX) and Iowa (IA) show notable increases. The variability in generation over time could be influenced by factors like state policies, technological advancements, and investment in renewable energy infrastructure.The overall increasing trend aligns with global efforts to transition to cleaner energy sources to reduce reliance on fossil fuels and combat climate change.

# Check unique states in both datasets
unique(df_generation_monthly$State)
##  [1] AK AZ CA CO DE FL HI IA ID IL IN KS MA MD ME MI MN MO MT NC ND NE NH NJ NM
## [26] NV NY OH OK OR PA RI SD TN TX UT VA VT WA WI WV WY AL AR CT DC GA KY LA MS
## [51] SC
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA MA ... WY
unique(df_pollutant_monthly$State)
##  [1] Alabama              Alaska               Arizona             
##  [4] Arkansas             California           Colorado            
##  [7] Connecticut          Country Of Mexico    Delaware            
## [10] District Of Columbia Florida              Georgia             
## [13] Hawaii               Idaho                Illinois            
## [16] Indiana              Iowa                 Kansas              
## [19] Kentucky             Louisiana            Maine               
## [22] Maryland             Massachusetts        Michigan            
## [25] Minnesota            Mississippi          Missouri            
## [28] Montana              Nebraska             Nevada              
## [31] New Hampshire        New Jersey           New Mexico          
## [34] New York             North Carolina       North Dakota        
## [37] Ohio                 Oklahoma             Oregon              
## [40] Pennsylvania         Puerto Rico          Rhode Island        
## [43] South Carolina       South Dakota         Tennessee           
## [46] Texas                Utah                 Vermont             
## [49] Virginia             Washington           West Virginia       
## [52] Wisconsin            Wyoming              Canada              
## [55] Virgin Islands      
## 55 Levels: Alabama Alaska Arizona Arkansas California Canada ... Wyoming
state_abbreviations <- c(AL = "Alabama", AK = "Alaska", AZ = "Arizona", AR = "Arkansas", CA = "California", 
                         CO = "Colorado", CT = "Connecticut", DE = "Delaware", FL = "Florida", GA = "Georgia", 
                         HI = "Hawaii", ID = "Idaho", IL = "Illinois", IN = "Indiana", IA = "Iowa", 
                         KS = "Kansas", KY = "Kentucky", LA = "Louisiana", ME = "Maine", MD = "Maryland", 
                         MA = "Massachusetts", MI = "Michigan", MN = "Minnesota", MS = "Mississippi", MO = "Missouri", 
                         MT = "Montana", NE = "Nebraska", NV = "Nevada", NH = "New Hampshire", NJ = "New Jersey", 
                         NM = "New Mexico", NY = "New York", NC = "North Carolina", ND = "North Dakota", 
                         OH = "Ohio", OK = "Oklahoma", OR = "Oregon", PA = "Pennsylvania", RI = "Rhode Island", 
                         SC = "South Carolina", SD = "South Dakota", TN = "Tennessee", TX = "Texas", UT = "Utah", 
                         VT = "Vermont", VA = "Virginia", WA = "Washington", WV = "West Virginia", WI = "Wisconsin", 
                         WY = "Wyoming")


df_pollutant_monthly$State <- sapply(df_pollutant_monthly$State, function(x) {
  name <- state_abbreviations[which(state_abbreviations == x)]
  if (length(name) == 0) NA else names(name)
})


df_merged_top_states <- merge(df_generation_monthly %>% filter(State %in% top_states), 
                              df_pollutant_monthly %>% filter(State %in% top_states), 
                              by = c("State", "Date"))

str(df_merged_top_states)
## 'data.frame':    12070 obs. of  8 variables:
##  $ State        : Factor w/ 51 levels "AK","AL","AR",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Date         : Date, format: "2001-01-01" "2001-01-01" ...
##  $ Year         : int  2001 2001 2001 2001 2001 2001 2001 2001 2001 2001 ...
##  $ Month        : int  1 1 1 1 1 2 2 2 2 2 ...
##  $ NetGeneration: num  137203 137203 137203 137203 137203 ...
##  $ Pollutant    : Factor w/ 5 levels "Carbon monoxide",..: 4 3 5 2 1 1 5 2 3 4 ...
##  $ Unit         : Factor w/ 4 levels "Micrograms/cubic meter (25 C)",..: 2 1 3 3 4 4 3 3 1 2 ...
##  $ Mean         : num  25.1 33.42 1.72 21.63 1.06 ...
# Creating a list of unique pollutants in our dataset
unique_pollutants <- unique(df_merged_top_states$Pollutant)

# Creating a list to store the plots
scatter_plots <- list()

# Looping through each unique pollutant and creating a scatter plot
for (pollutant in unique_pollutants) {
  plot_data <- df_merged_top_states[df_merged_top_states$Pollutant == pollutant, ]
  
  plot <- ggplot(plot_data, aes(x = NetGeneration, y = Mean, color = State)) +
    geom_point() +
    labs(title = paste("Net Generation vs.", pollutant),
         x = "Net Generation",
         y = "Pollutant Concentration",
         color = "State") +
    theme_minimal()
  
  scatter_plots[[pollutant]] <- plot
}

# Printing the scatter plots
scatter_plots
## $`PM2.5 - Local Conditions`

## 
## $`PM10 Total 0-10um STP`

## 
## $`Sulfur dioxide`

## 
## $`Nitrogen dioxide (NO2)`

## 
## $`Carbon monoxide`

#Interpretation

PM2.5 and Net Generation: There doesn’t seem to be a clear relationship between net generation and PM2.5 levels. While some states with higher net generation have lower PM2.5 levels, the data is scattered, suggesting other factors may be at play in determining PM2.5 concentrations.

PM10 and Net Generation: Similar to PM2.5, the PM10 concentrations do not show a clear trend in relation to net generation. There is significant scatter across all levels of net generation.

Sulfur Dioxide (SO2) and Net Generation: The plot for SO2 shows a dense clustering of lower SO2 levels at higher net generation, which might indicate that higher renewable energy generation could be associated with lower SO2 concentrations.

Nitrogen Dioxide (NO2) and Net Generation: The scatter plot for NO2 presents a wide distribution of pollutant concentrations across the net generation axis, indicating no strong correlation between the two.

Carbon Monoxide (CO) and Net Generation: The CO levels are spread across the generation axis, but there is a noticeable cluster of lower CO concentrations at higher levels of net generation.

So, in conclusion, upon reviewing the scatter plots for PM2.5, PM10, NO2, and SO2 in relation to renewable energy generation, it is the scatter plots for Sulfur Dioxide (SO2) and Carbon Monoxide (CO) that show some indication of a relationship. For SO2, higher levels of renewable energy generation seem to correspond with a clustering of lower pollutant concentrations. Similarly, the scatter plot for CO also shows a clustering of data points towards lower CO levels as net generation increases. These observations suggest an inverse relationship where increased renewable energy generation could be associated with decreased emissions of SO2 and CO, which are pollutants typically associated with the combustion of fossil fuels.

Question 1:

What is the relationship between distributed renewable energy generation and the level of air pollution?

Question 2:

Among air quality indicators (PM10, PM2.5, CO, NO2, and SO2), which display the most significant response to variations in energy generation?

Summary and Conclusions

References

Kabeyi, Moses Jeremiah Barasa, and Oludolapo Akanni Olanrewaju. ‘Sustainable Energy Transition for Renewable and Low Carbon Grid Electricity Generation and Supply’. Frontiers in Energy Research, vol. 9, 2022. Frontiers, https://www.frontiersin.org/articles/10.3389/fenrg.2021.743114.

Environmental Impacts of Renewable Energy Technologies | Union of Concerned Scientists. https://www.ucsusa.org/resources/environmental-impacts-renewable-energy-technologies. Accessed 6 Dec. 2023.

Air Pollution. https://education.nationalgeographic.org/resource/air-pollution. Accessed 6 Dec. 2023.